RepresentationSeries: pca, nmf, tsne #140

henrifroese · 2020-08-05T18:17:32Z

implement full support for Representation Series in "Dimensionality Reduction" functions of representation module
add appropriate (parameterized) tests in test_representation.py

Note: the dimensionality reduction functions do support both flat and represenational input; this is because as discussed in #134 we will not implement a to_repr function, so users might have e.g. a flattened tfidf series they want to perform pca on -> we allow them to use the dimensionality reduction functions. Of course, we could remove the flat support (if we basically want to "forbid" users from using VectorSeries for tfidf/term_freq/count); we're not 100% sure.

…ion" funcionts of representaion module - Implement full support for Representation Series in "Dimensionality Reduction" functions of representation module - add appropriate parameterized tests in `test_representation.py` Note: the dimensionality reduction functions *do support* both flat and represenational input; this is because as discussed in jbesomi#134 we will not implement a `to_repr` function, so users might have e.g. a flattened tfidf series they want to perform pca on -> they can use the function. Of course, we could remove the flat support (if we basically want to "forbid" users from using `VectorSeries` for tfidf/term_freq/count; we're not 100% sure. Co-authored-by: Maximilian Krahn <[email protected]>

jbesomi · 2020-08-05T19:29:35Z

🎉

Opinions:

I would stick to the rule "One function - one single input HeroSeries type"

Review:

Output of both pca and nmf make sense to be VectorSeries
pca by its algorithm definition require as input a dense array (as data are normalized). The input, in this case, should be VectorSeries.
--> We will need to explain clearly in the documentation that is very advised to limit the length of the vector (i.e number of features in tfidf)
--> pseudocode: s.tfidf(max_features=200).flatten().pca() ...
--> We need to check the input is VectorSeries. If it has multiindex we need to explain the function should be flattened.
nmf work with a sparse matrix. In this case, it would be inefficient not to work with a RepresentationSeries as input.
--> In getting started/representation we will need to explain the difference between the two approaches and explain which approach (pca or nm) is preferred and when
tsne: TODO, not sure if it uses dense or sparse input ...

Hope you got some insights! 👍

henrifroese · 2020-08-06T17:25:38Z

I believe that it is a bad idea to restrict the pca input to VectorSeries. Users will almost always first apply one of count, term_frequency, tfidf, so they have a RepresentationSeries of their documents. They then want to do dimensionality reduction. If pca takes VectorSeries input, and nmf, tsne take RepresentationSeries input, we have very non-user-friendly behaviour that goes against the goal of texthero:

df["pca"] = df["texts"].pipe(hero.tfidf).pipe(hero.flatten).pipe(hero.pca)
df["nmf"] = df["texts"].pipe(hero.tfidf).pipe(hero.nmf)

So two functions that both do dimensionality reduction need a different pipeline, that's not nice for users!

Yes, pca internally in sklearn does not handle Sparse input. That is why in the code we do in line 492: s_for_vectorization = s_coo_matrix.todense() # PCA cannot handle sparse input.

This allows the user to call all dimensionality reduction functions the same way. This is why I see two possibilities:

Allow both RepresentationSeries and VectorSeries input for dimensionality reduction functions (but I understand you don't want to do this)
Only allow RepresentationSeries input

I think it should stay an implementation detail and not be visible to users that pca cannot handle the sparseness internally (that's why we issue a warning if the input's dimensionality is too high; we can also add a comment about that in a tutorial). It's much nicer if the dim. red. functions are all called the same way. What do you think?

jbesomi · 2020-08-06T19:36:14Z

Hey Henri, I totally agree with your view and understand your points!

I agree with having all representation functions (pca, nmf, tfidf) to receive as input a RepresentationSeries.

I think it should stay an implementation detail and not be visible to users that pca cannot handle the sparseness internally (that's why we issue a warning if the input's dimensionality is too high; we can also add a comment about that in a tutorial). It's much nicer if the dim. red. functions are all called the same way. What do you think?

I believe the users should be very aware that pca cannot handle sparse input. We will have to make this very clear in the docstring (maybe citing an external source explaining why or eventually we will write a blog post on that) that pca will dense the Series. Going further, we will mention that the number of features should be limited (by saing to set max_features=X) as well as we will encourage to use nmf in case the number of features is very large. One of Texthero goal is also to teach new users, and having a rough idea of advantages/disadvantages of the different dim. red. algorithms is for sure important.

Regarding the max_features default: an alternative might be to set the default max_features argument to the representation functions (tfidf, term_freq, ...) to 100/200 or 300. Opinions? I haven't a clear position on that, but generally I liked the idea of keeping (when it make sense and it's feasible) the same values as scikit-learn.

henrifroese · 2020-08-07T16:38:49Z

I believe the users should be very aware that pca cannot handle sparse input. We will have to make this very clear in the docstring (maybe citing an external source explaining why or eventually we will write a blog post on that) that pca will dense the Series. Going further, we will mention that the number of features should be limited (by saing to set max_features=X) as well as we will encourage to use nmf in case the number of features is very large. One of Texthero goal is also to teach new users, and having a rough idea of advantages/disadvantages of the different dim. red. algorithms is for sure important.

I will add something about that to the docstring! I agree that when writing a full tutorial on representation, we have to describe all this in detail; otherwise users will just use pca and wonder why it crashes their colab 🙃 .

Regarding the max_features default: an alternative might be to set the default max_features argument to the representation functions (tfidf, term_freq, ...) to 100/200 or 300. Opinions? I haven't a clear position on that, but generally I liked the idea of keeping (when it make sense and it's feasible) the same values as scikit-learn.

I agree it's good to stick to the scikit-learn defaults and mention this in the tutorial (the representation tutorial will need to be very good 🤓 !)

- dim. red. functions only support Representation Series input - appropriately change tests

henrifroese · 2020-08-07T17:18:41Z

Just removed support for VectorSeries input in dimensionality reduction functions. Should be ready to merge now

QUESTION: About the clustering functions: should they:

(a) only support VectorSeries input (Pro: probably mostly get called after dimensionality reduction -> VectorSeries as input)
(b) support VectorSeries and RepresentationSeries as input (Pro: some users might want to call it straight after tfidf etc.; dbscan, kmeans also directly support sparse input so users can efficiently work straight with RepresentationSeries)

jbesomi · 2020-08-08T16:45:27Z

Good question!

a) Only vector series: would very inefficient in some cases ...

b) We would not respect the rule "one function type for s" ... (but c'est la vie ... !)
--> And what if dim red functions return only RepresentationSeries and cluster functions accept only RepresentationSeries? How do you see this option? (User then would use flatten to store the Series in the DataFrame). The problem then would be if a user wants to re-extract a Series from a PandasDataFrame, in this case, it would receive a VectorSeries that would not work in any clustering function ... not the best idea ...

So, to recap right now we have:

embedding functions (tfidf, count, ...): input TokenSeries, output RepresentationSeries
dim red: input RepresentationSeries, output: VectorSeries
clustering: input RepresentationSeries or VectorSeries, output: ?

This might be a bit confusing/overcomplicated/limited, what's your advice on that?

An alternative might be: keep only RepresentationSeries and if needed use flatten to save in a DataFrame, but then can we extract again a RepresentationSeries?

Conclusion: I would like to see/understand your opinion! 👍

henrifroese · 2020-08-09T12:33:50Z

I agree it's really annoying that we can't just only work with RepresentationSeries as most users will probably use DataFrames.

An alternative might be: keep only RepresentationSeries and if needed use flatten to save in a DataFrame, but then can we extract again a RepresentationSeries?

I think even if possible, it's not really a nice UX if users have to use flatten and to_repr often.

I think the possibilities are:

Fully switch to RepresentationSeries, even for non-sparse stuff like output from dimensionality reduction and embed. This however destroys DataFrame compatibility
Forget the one function, one input rule (even though it's useful), and:
1. let count, tfidf, term_frequency return RepresentationSeries
2. let embed, pca, nmf, tsne (as it's never Sparse) return VectorSeries
3. Let pca, nmf, tsne, all clustering functions handle both RepresentationSeries and VectorSeries input

I believe that possibility two is the most intuitive for users, as the only time they'll ever notice it is when trying to put a Series with count, tfidf, term_frequency into a DataFrame; we'll clearly explain that they'll have to use flatten.

henrifroese · 2020-08-22T08:02:05Z

Closing this, see #156

henrifroese added the enhancement New feature or request label Aug 5, 2020

vercel bot deployed to Preview August 5, 2020 18:17 View deployment

henrifroese changed the title ~~RepresentationSeries: pca, nmf~~ RepresentationSeries: pca, nmf, tsne Aug 6, 2020

jbesomi marked this pull request as draft August 7, 2020 15:26

Remove support for vector series input in dim. red. functions

59d566f

- dim. red. functions only support Representation Series input - appropriately change tests

vercel bot deployed to Preview August 7, 2020 17:17 View deployment

jbesomi mentioned this pull request Aug 9, 2020

Support for Flair Embeddings: hero.embed(s, flair_embedding) function #146

Draft

henrifroese closed this Aug 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepresentationSeries: pca, nmf, tsne #140

RepresentationSeries: pca, nmf, tsne #140

henrifroese commented Aug 5, 2020

jbesomi commented Aug 5, 2020 •

edited

Loading

henrifroese commented Aug 6, 2020 •

edited

Loading

jbesomi commented Aug 6, 2020

henrifroese commented Aug 7, 2020

henrifroese commented Aug 7, 2020 •

edited

Loading

jbesomi commented Aug 8, 2020

henrifroese commented Aug 9, 2020 •

edited

Loading

henrifroese commented Aug 22, 2020

RepresentationSeries: pca, nmf, tsne #140

RepresentationSeries: pca, nmf, tsne #140

Conversation

henrifroese commented Aug 5, 2020

jbesomi commented Aug 5, 2020 • edited Loading

henrifroese commented Aug 6, 2020 • edited Loading

jbesomi commented Aug 6, 2020

henrifroese commented Aug 7, 2020

henrifroese commented Aug 7, 2020 • edited Loading

jbesomi commented Aug 8, 2020

henrifroese commented Aug 9, 2020 • edited Loading

henrifroese commented Aug 22, 2020

jbesomi commented Aug 5, 2020 •

edited

Loading

henrifroese commented Aug 6, 2020 •

edited

Loading

henrifroese commented Aug 7, 2020 •

edited

Loading

henrifroese commented Aug 9, 2020 •

edited

Loading